Skip to content

Use GPUArrays accumulation implementation #2813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

christiangnrd
Copy link
Member

Opened to run benchmarks.

Todo:

  • Add compat bound when GPUArrays version released

Copy link
Contributor

github-actions bot commented Jul 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/perf/array.jl b/perf/array.jl
index 3dbab9816..400de2231 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -54,11 +54,11 @@ let group = addgroup!(group, "reverse")
     group["1d"] = @async_benchmarkable reverse($gpu_vec)
     group["1dL"] = @async_benchmarkable reverse($gpu_vec_long)
     group["2d"] = @async_benchmarkable reverse($gpu_mat; dims=1)
-    group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims=1)
+    group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims = 1)
     group["1d_inplace"] = @async_benchmarkable reverse!($gpu_vec)
     group["1dL_inplace"] = @async_benchmarkable reverse!($gpu_vec_long)
     group["2d_inplace"] = @async_benchmarkable reverse!($gpu_mat; dims=1)
-    group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims=2)
+    group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims = 2)
 end
 
 group["broadcast"] = @async_benchmarkable $gpu_mat .= 0f0
diff --git a/test/runtests.jl b/test/runtests.jl
index b6c479cce..89bf840c9 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -5,7 +5,7 @@ using Printf: @sprintf
 using Base.Filesystem: path_separator
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
 
 # parse some command-line arguments
 function extract_flag!(args, flag, default=nothing; typ=typeof(default))

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 3c02fa9 Previous: 205c238 Ratio
latency/precompile 43098463154.5 ns 42934926801 ns 1.00
latency/ttfp 7012905021 ns 7008552789 ns 1.00
latency/import 3574306668 ns 3569139582 ns 1.00
integration/volumerhs 9610435 ns 9606581 ns 1.00
integration/byval/slices=1 147160 ns 147311 ns 1.00
integration/byval/slices=3 426070 ns 426127 ns 1.00
integration/byval/reference 145095 ns 145282 ns 1.00
integration/byval/slices=2 286522 ns 286537 ns 1.00
integration/cudadevrt 103592 ns 103674 ns 1.00
kernel/indexing 14293 ns 14638.5 ns 0.98
kernel/indexing_checked 14958 ns 15045 ns 0.99
kernel/occupancy 720.3851351351351 ns 669.9465408805031 ns 1.08
kernel/launch 2162.222222222222 ns 2202.4444444444443 ns 0.98
kernel/rand 18437 ns 17466 ns 1.06
array/reverse/1d 20190 ns 20143 ns 1.00
array/reverse/2d 23777 ns 24692 ns 0.96
array/reverse/1d_inplace 10893 ns 11332 ns 0.96
array/reverse/2d_inplace 13309 ns 13662 ns 0.97
array/copy 21111 ns 21281 ns 0.99
array/iteration/findall/int 118061 ns 159966.5 ns 0.74
array/iteration/findall/bool 98917 ns 141602 ns 0.70
array/iteration/findfirst/int 158577.5 ns 163419 ns 0.97
array/iteration/findfirst/bool 159266.5 ns 165377 ns 0.96
array/iteration/scalar 73974 ns 76152 ns 0.97
array/iteration/logical 175055 ns 219912.5 ns 0.80
array/iteration/findmin/1d 47340 ns 47580 ns 0.99
array/iteration/findmin/2d 96420 ns 97060 ns 0.99
array/reductions/reduce/Int64/1d 46877 ns 43742.5 ns 1.07
array/reductions/reduce/Int64/dims=1 53196 ns 47519.5 ns 1.12
array/reductions/reduce/Int64/dims=2 62497.5 ns 62503 ns 1.00
array/reductions/reduce/Int64/dims=1L 89099 ns 89134 ns 1.00
array/reductions/reduce/Int64/dims=2L 90082.5 ns 87634.5 ns 1.03
array/reductions/reduce/Float32/1d 34719 ns 35637 ns 0.97
array/reductions/reduce/Float32/dims=1 51741 ns 51967.5 ns 1.00
array/reductions/reduce/Float32/dims=2 59582 ns 59824 ns 1.00
array/reductions/reduce/Float32/dims=1L 52550 ns 52680 ns 1.00
array/reductions/reduce/Float32/dims=2L 70184 ns 70568 ns 0.99
array/reductions/mapreduce/Int64/1d 46053 ns 43514 ns 1.06
array/reductions/mapreduce/Int64/dims=1 53641.5 ns 46605.5 ns 1.15
array/reductions/mapreduce/Int64/dims=2 63304.5 ns 62143.5 ns 1.02
array/reductions/mapreduce/Int64/dims=1L 89158 ns 89174 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 87207.5 ns 87305.5 ns 1.00
array/reductions/mapreduce/Float32/1d 34806 ns 35464 ns 0.98
array/reductions/mapreduce/Float32/dims=1 48546 ns 42505.5 ns 1.14
array/reductions/mapreduce/Float32/dims=2 59774 ns 60252 ns 0.99
array/reductions/mapreduce/Float32/dims=1L 52862 ns 52803 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 70614 ns 70795 ns 1.00
array/broadcast 20552 ns 20737 ns 0.99
array/copyto!/gpu_to_gpu 11319 ns 13192 ns 0.86
array/copyto!/cpu_to_gpu 215254.5 ns 217123 ns 0.99
array/copyto!/gpu_to_cpu 283817 ns 287100 ns 0.99
array/accumulate/Int64/1d 80265 ns 126109 ns 0.64
array/accumulate/Int64/dims=1 220793 ns 84201 ns 2.62
array/accumulate/Int64/dims=2 112332 ns 158968 ns 0.71
array/accumulate/Int64/dims=1L 410035 ns 1710638 ns 0.24
array/accumulate/Int64/dims=2L 5155424 ns 967410.5 ns 5.33
array/accumulate/Float32/1d 55731 ns 109994 ns 0.51
array/accumulate/Float32/dims=1 201773 ns 81343 ns 2.48
array/accumulate/Float32/dims=2 92523 ns 148659 ns 0.62
array/accumulate/Float32/dims=1L 245125 ns 1619411 ns 0.15
array/accumulate/Float32/dims=2L 3735231 ns 699433 ns 5.34
array/construct 1260.9 ns 1288.5 ns 0.98
array/random/randn/Float32 47976 ns 45344 ns 1.06
array/random/randn!/Float32 24949 ns 25330 ns 0.98
array/random/rand!/Int64 27300 ns 27554 ns 0.99
array/random/rand!/Float32 8829 ns 8908.333333333334 ns 0.99
array/random/rand/Int64 30165 ns 30218 ns 1.00
array/random/rand/Float32 13153 ns 13361 ns 0.98
array/permutedims/4d 60598.5 ns 60397 ns 1.00
array/permutedims/2d 54811 ns 54394 ns 1.01
array/permutedims/3d 55558 ns 55362 ns 1.00
array/sorting/1d 2760989 ns 2758561 ns 1.00
array/sorting/by 3368803.5 ns 3368461 ns 1.00
array/sorting/2d 1088682 ns 1089562 ns 1.00
cuda/synchronization/stream/auto 1027.6 ns 1066.6 ns 0.96
cuda/synchronization/stream/nonblocking 7564.6 ns 7691.3 ns 0.98
cuda/synchronization/stream/blocking 815.8111111111111 ns 844.0121951219512 ns 0.97
cuda/synchronization/context/auto 1153.8 ns 1211.4 ns 0.95
cuda/synchronization/context/nonblocking 8424.400000000001 ns 6881.1 ns 1.22
cuda/synchronization/context/blocking 894.8888888888889 ns 924.7692307692307 ns 0.97

This comment was automatically generated by workflow using github-action-benchmark.

@kshyatt kshyatt added the cuda kernels Stuff about writing CUDA kernels. label Jul 26, 2025
@maleadt
Copy link
Member

maleadt commented Jul 29, 2025

Well, that's a bit all over the place.

[only benchmarks]
@christiangnrd
Copy link
Member Author

Well, that's a bit all over the place.

Indeed..

Hacking in the big mapreduce kernel heuristic for the by-threads or by-block decision into AK we recover most of the performance discrepancy. It's still a regression, but on a 3090 the Int64/dims=1 ratio is 7.3 for AK/master, and 1.7 for (AK with better heuristic)/master. For Float32/dims=1 the ratios are 6.1 and 1.2 respectively. Still need to figure out what's happening with dims=2L, but at least the performance discrepancy won't be as bad once JuliaGPU/KernelAbstractions.jl#631 is figured out and implemented in AK.

@maleadt
Copy link
Member

maleadt commented Jul 30, 2025

1.7 for (AK with better heuristic)/master

That's too bad, and very much at odds with the results I've seen presented on AK.jl at e.g. JuliaCon. I guess the reduction kernel wasn't really optimized properly yet (the paper seems to focus on sorting operations).

@christiangnrd
Copy link
Member Author

I guess the reduction kernel wasn't really optimized properly yet

I suspect the dims=2L case could be mitigated by a better heuristic for determining block size than always 256. That test is quite the weird shape and there's a huge improvement with greater block sizes

@kshyatt
Copy link
Member

kshyatt commented Aug 4, 2025

@christiangnrd do you think it's worthwhile to open an issue at AK.jl about this (if one isn't open already)?

@christiangnrd
Copy link
Member Author

@christiangnrd do you think it's worthwhile to open an issue at AK.jl about this (if one isn't open already)?

Good idea. I opened #60

@anicusan
Copy link
Member

Do I read this correctly?

For accumulate:

  • 1d is faster (0.51 / 0.64).
  • Nd is faster for one dim but slower on the other; this might be something to improve on switching between the by-thread and by-block algorithms.

In the other PR (#2815) for mapreduce:

  • 1d is comparable
  • Nd is much slower for the L cases; faster for one dim but slower on the other otherwise.

This is the same trend as the timings I posted when first implementing N-dimensional reductions (JuliaGPU/AcceleratedKernels.jl#6 (comment)); AK-0.1 didn't have dims :)

@christiangnrd is right, we should definitely improve the heuristic for switching between the by-thread and by-block algorithms. For the innermost reduction kernel though the CUDA.jl algorithm should be superior, and until we have warp sizes and shuffle instructions exposed in KernelAbstractions I don't think we can do much better (implementation and notes here). What is better (and original afaik) in the AK mapreduce is not doing recursive memory allocations when needing multiple kernel launches (switching views into different ends of the same vector here) - memory consumption is bounded and known upfront.

I'll use the L cases for Nd mapreduce to investigate bottlenecks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants